Simple Power and Sample Size Estimation for Non-Randomized Longitudinal Difference in Differences Studies.

Intervention effects on continuous longitudinal normal outcomes are often estimated in two-arm pre-post interventional studies with b≥1 pre- and k≥1 post-intervention measures using "Difference-in-Differences" (DD) analysis. Although randomization is preferred, non-randomized designs are often necessary due to practical constraints. Power/sample size estimation methods for non-randomized DD designs that incorporate the correlation structure of repeated measures are needed. We derive Generalized Least Squares (GLS) variance estimate of the intervention effect. For the commonly assumed compound symmetry (CS) correlation structure (where the correlation between all repeated measures is a constantρ) this leads to simple power and sample size estimation formulas that can be implemented using pencil and paper. Given a constrained number of total timepoints (T), having as close to possible equal number of pre-and post-intervention timepoints (b=k) achieves greatest power. When planning a study with 7 or less timepoints, given large ρ(ρ≥0.6) in multiple baseline measures (b≥2) or ρ≥0.8 in a single baseline setting, the improvement in power from a randomized versus non-randomized DD design may be minor. Extensions to cluster study designs and incorporation of time invariant covariates are given. Applications to study planning are illustrated using three real examples with T=4 timepoints and ρ ranging from 0.55 to 0.75.


Background
Medical intervention studies of chronic conditions and other ongoing processes often evaluate repeated measures of continuous normal outcomes on persons, facilities or other units at systematic timepoints before and after an intervention is delivered [1]. Some of the This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. units are chosen to receive the [1]. Some of the units are chosen to receive the intervention which starts at the same time in all of those units. While randomization of which units (i.e. either individual or facility) receive the intervention is preferred, it is not always feasible; particularly in health economics and services research. Difference-in-Differences (DD) designs thus often estimate impact of a new intervention or policy introduced at a given timepoint for non-randomized treating facilities (or individuals), compared to controls continuing on the existing regimen or policy [2][3][4]. The outcome being affected by the intervention is measured at b consecutive timepoints (enumerated −b, −(b-1),…. −1) prior to and k. The difference in outcome(s) for the intervention arm between the b pre-and k postintervention periods is compared to the corresponding difference for the control arm. Now DD analysis is best applied using a mixed model framework that adjusts for serial correlation of repeated measures within the same intervention facility or individual [1]. We assume "Non-Randomized" allocation to intervention and control arms being done by convenience or some other process that is not purposely based on levels of the outcome over the first b timepoints. For example, maybe hospitals that are closer to a university are assigned the intervention developed at university. Still the pre-intervention levels of the outcome may differ by an unknown amount between the intervention arms due to the criteria that the circumstance allocation was based on even though the investigator was not deliberately seeking for this to happen. For example, perhaps for historical reasons, the outcome measure tends to be higher at those hospitals that were closer to the university. This pre-existing baseline difference between the arms over the b pre-intervention timepoints will carry through to any post intervention effect during the k post-intervention timepoints.
However, as discussed later if the investigator deliberately over selects individual hospitals specifically based on having higher or lower outcome level then the methods covered here do not apply.
In evaluating intervention effect studies, it is important to estimate whether one has a large enough sample to generate precise results. This is commonly denoted power estimation. For repeated-measure longitudinal studies, the "power" depends on an often-unknown correlation structure between repeated measures of the same unit (which may either be a facility or a person) [5]. Repeated measures within the same unit are typically positively correlated which complicates power estimation as well as statistical analysis compared to the standard setting of independence. While general linear models (GLMs) for both statistical analysis and power estimation are well known [6][7][8] for randomized studies, less power estimation literature exists for non-randomized studies using Difference-in-Differences analysis. This paper develops a generalized least squares (GLS) power estimation framework for non-randomized DD studies using the commonly-assumed compound symmetry (CS) correlation structure of repeated measures that leads to simple power and sample size estimation formulas.
The paper is organized as follows: We first review the standard hypothesis testing and power estimation approach. Next a general linear model of non-randomized pre-post interventional studies with repeated measures using the Difference-in-Differences estimator is presented that develops a standard Generalized Least Squares variance estimate of the intervention effect to be incorporated into the standard power estimation approach. Under the common assumption of compound symmetry repeated-measure correlation, a simple GLS variance formula for intervention effect is derived for non-randomized Difference-in-Differences studies. The influence of number or pre-and post-intervention delivery measures on this variance (and thus efficiency of the study) is evaluated based on this formula. The relative efficiency of non-randomized design is then compared to the randomized setting in terms of needed sample size to achieve the same power. These methods are extended to cluster study designs and incorporation of time invariant covariates. Finally, applications to some recent examples are presented.

General power estimation framework
We consider H o : θ=0 versus H A : θ=θ A 0 where θ A is some expected or hypothesized value for the intervention effect we wish to statistically detect. Without loss of generality, δ = θ A σ is the effect size [9] or θ A expressed as units of standard deviation. For practical repeatedmeasure designs, the normal approximation of the non-central t distribution can be applied [10]. In specific, the two distributions are almost identical when degrees of freedom (DF) γ>30 and we have the following equations of power (1−β) in eqn. (1), in which Var (θ )| expressed as GLS variance estimate.
where α and β are Type I and Type II errors, respectively.
It should be noted here that for smaller sample sizes, it might be appropriate to approximate degrees of freedom in the non-central t distribution for the mixture variance (for example, by Satterthwaite's [11], and Kenward-Roger's approximations [12]) and adjust (1) for this. But while the full details are beyond the scope of this paper, such will typically not be needed in practice. We now proceed to the derivation of Var (θ ) for the Difference-in-Differences design within the General Linear Model Framework.

General linear model (GLM)
For non-randomized pre-post interventional studies with two intervention arms, researchers encounter repeated measures of a quantitative outcome at T=b+k systematic timepoints with b being before and k being after the intervention is delivered to one of the arms. Let h denote the intervention arm with h=0 for control and h=1 for the new intervention. For each arm, there are n h units (n o for the control and n 1 for the new intervention) and j={−b, −(b−1),…, −1, 1, 2,…, k} denotes the ordered times with {−b, −(b−1),…, −1} prior to and {1, 2,…, k} being after the intervention onset. The goal is to assess the impact of the new intervention (versus control) on pre-post change in a longitudinal continuous outcome Y where Y 1ij is measure j from unit i in the new intervention arm and Y 0i', j 'is measure j' from unit i' in the control arm. For example, consider a non-randomized trial with n 0 =n 1 =30 hospitals in each arm. Let i denote hospitals (as "units") where i=1, …,n h . The "units" are measured annually for T=7 years total with b=2 years (2001 to 2002) before and k=5 (2003 to 2007) after the intervention implementation in the intervention arm (h=1). The outcome of interest, Y, could be portion of patients discharged within 30 days after surgery. Thus, Y 1,3,−2 and Y 0,17,3 ', respectively, denote the measurement taken in 2001 (2 years prior to start of the intervention) in the 3 rd hospital of the intervention arm and 2005 (3 years after the start of the intervention) in the 17 th hospital of the control arm, respectively. We assume complete data with T=b+k measures observed on each unit; Y hij can be decomposed as: Now (α) is an intercept, which corresponds to the centrality of the control arm. The fixed effect (γ) is the difference between the main effect of the intervention and control arms due to non-randomized selection as discussed earlier. The fixed time effect (β j ) is modeled to allow for temporal effects at timepoint j that are common to both arms.
We assume an immediate impact of size θ (i.e., as the DD effect) on the outcome variable for the intervention arm after the intervention begins at time j=1 that remains unchanged at subsequent timepoints, which is captured in eqn. (2) by I {h=1,j>0} as the intervention effect (θ) only delivers to the intervention arm (h=1) on the k post-intervention measurements. This is an intervention by time interaction mediated by the intervention effect after the it begins. Note that other functions such as linear intervention effect increase j * θZ hj for j≥1 or immediate post-intervention jump followed by exponential decay e −j *θZ hj for j ≥ 1 are possible. However, there are settings where an immediate "jump effect" that continues forward unchanged is appropriate, such as when the intervention is a process change at a medical facility that can be implemented quickly, a drug that the body does not develop resistance or acclimation to, or an immediately successful behavioral intervention. Even if the intervention impact is not "immediate jump", it could be close to this.
Any random unit (i th level) effects are subsumed into the within-unit error term ε ij, where ε ij~N (0,σ 2 V) with the correlation matrix V defined below in eqn. (3). But related to the second paragraph of the Background section, another important assumption on the error term is that of endogenicity; ε ij must be independent of the covariates in the model [13]. In this case the covariates are the intercept, indicator for assignment to the intervention arm, indicator of timepoint and the intervention by timepoint interactions. After collection of the data (but not during study planning) there are tests for whether endogenicity problems exist in the data [13]. Still, it is difficult to think of when endogenicity would not hold for our setting with one important exception that was noted earlier. If pre-intervention levels of the outcome are used to identify which units receive the intervention, for example, if poorly performing units or alternately those performing well (i.e. at timepoints j<0) are individually over-selected for the intervention, this will likely create endodenocity where ε ij for j<0 is correlated with the intervention arm assignment. Then issues of regression to the mean [14] destabilize the analyses as the ε ij for j<0 are not correlated or are less correlated with intervention arm assignment. Hu and Hoover Page 4 The analyses described here using DD estimators are generally not tenable when selection for intervention arm assignment is deliberately based on observed performance of the unit. It is, however, acceptable if better or poorer performing units are placed into one of the arms by circumstance as long as the selection criteria are based on the strata these units fell into being overall better or poorer performing strata. This selection criteria would be independent of ε ij and would therefore be captured by inclusion of γI {h=1} in eqn. (2).

Generalized least squares (GLS) estimate
The matrix form of eqn. (2) More details on the full expansion of design matrix are in Appendix 1. The covariance matrix V is made up with (n 0 +n 1 ) times block T diagonal matrices V 0 's with all off-block diagonal matrix elements being 0. The error term measures are independent between units, and within-unit correlation structure is invariant between units (i≠i′) for any given two timepoints j and j'(j≠j′), that is, ρ i,jj′ =ρ i′,jj′ . Thus, The within-unit correlation structure (ρ jj′ ) is often unknown in advance. Typically, correlation for any two timepoints would be monotonically non-increasing with |j -j'|, i.e., as the two timepoints are further separated, they will not become more strongly correlated [15][16][17].
The Generalized Least Squares (GLS) estimate for β is β is in eqn. (4), which has proven properties of being the best linear unbiased estimator for β and uniform minimum variance if Y hij is normally distributed [6] is now given Hu and Hoover Page 5 The Generalized Least Squares variance of β is Λ in eqn. (5); a square matrix of order T+1 with the variance of θ the estimated intervention effect being in the last row and last column of Λ.

Results (For Compound Symmetry Correlation) GLS estimate
As previously noted, one main difficulty in parametric analysis of longitudinal data lies in specifying covariance structure [17,18], i.e. estimating ρ jj′ for j ≠ j', as normative data from historical settings often does not exist or is limited. However, compound symmetry structure (V CS ), in which correlations among repeated measures are assumed to be equal within the same unit, is often a reasonable assumption [19][20][21]. For example, V CS is shown below with T=7.
We can then plug in the Var (θ ) in eqn. (6) The next three sections use eqns. (6) and (7) to identify optimal Difference-in-Differences designs, evaluate relative efficiency of non-randomized to randomized designs, and extend to both non-randomized cluster designs and inclusion of additional time invariant covariates into the model. The last section presents applications using derived formulas in three representative examples.

Optimal pre-post intervention allocation of timepoints
The relatively simple form of eqn. (6), simplifies investigation on optimal pre-post intervention allocation in planning non-randomized DD studies. For example, a repeatedmeasure design may have a constrained total number of timepoints T (T=b + k) because of limited budget and/or time. In such scenarios, finding the optimal allocation of T into b and k that maximizes power (or minimizes the sample size needed to obtain a given power) is important. From eqns. (6) and (7), for CS structure with constrained T given ρ, the optimal b * with the local minimization of variance is given when bk=b(T−b) is maximized.
This occurs at b * ; If T is even, then b* = k* = T 2 ; and if T is odd, then equally b* = T − 1 2 or T + 1 2 Figure 1 presents the GLS variance estimates of the intervention effect by the pre-post intervention timepoints allocation and ρ (assumingσ 2 =100) for T=4, T=7, respectively. From the previous discussion, the optimal b * that minimizes the GLS variance in eqn. (6) is b * =2 for T=4, and b * =3 or 4 for T=7. While in common practice, b=1 is chosen to get units shifted onto intervention as soon as possible, delaying this switch by having multiple preintervention timepoints (e.g. b=2 or b=3 when T=7) substantially reduced the GLS variance estimate of the intervention effect.

Comparison between the randomized and non-randomized setting
While it is known that randomization is superior to non-randomization, as randomized studies may be more costly and difficult to conduct, the relative superiority may be important to know. Earlier work [19,22] have shown that for a randomized study conducted using the commonly assumed compound symmetry correlation (as shown in Appendix 2).
To compare eqn. (8) to eqn. (6), we must first address the impact that non-randomization has on ρ and σ 2 . For any given setting, σ 2 will be larger while ρ will be smaller in a randomized design as variance about a common global mean due to randomization will be larger than variance about different intervention arm means in the non-randomized design. Compared to the randomized setting with a common intercept, non-randomized setting will result in a lower within population variance on Y hij where σ 2 = σ NR 2 < σ R 2 , together with a smaller within-unit correlation of Y hij and Y hij where ρ=ρ NR <ρ R , due to elimination of variance ( h ) from the total variance. Definitions of σ NR 2 , σ R 2 , ρ NR, ρ R and σ e randomized setting in that 1 − ρ NR σ NR . This invariance property means that the same effect parameters σ 2 and ρ chosen for a randomized design can be directly used in eqn. (6) to estimate the variance of the intervention effect for the non-randomized designs no matter what the impact non-randomization has on the final σ 2 and ρ is.
To quantitatively measure the difference between randomized and non-randomized studies, we first calculate the ratio of the variance estimate under CS assumption using eqns. (6) and (8) with randomized setting as a reference where as shown above ρ is taken from the randomized setting.
As ρ→1, the ratio goes to 1, meaning the randomized design behaves similar but still better than the non-randomized design when ρ is close to 1. As ρ→0, the ratio reduces to 1 + k b , meaning the non-randomized setting requires 1 + k b times more units than the comparable randomized setting to achieve the same power when ρ is close to 0. Thus, increasing k or decreasing b (with all other parameters fixed) can lead to more advantages in conducting randomization. For b≥k, the ratio lies within (1, 2); for very small k b , the ratio is close to 1, meaning that randomization does not qualitatively reduce the GLS variance estimate of the intervention effect. T=4, for ρ≥0.6 and for b≥2, non-randomization performed close to randomization as the variance ratio was less than 1.14. But for b=1, variance from non-randomization did not approach that from randomization until ρ≥0.8 where the variance ratio was 1.17. Similarly, when T=7, non-randomization performed close to randomization as the variance ratio was less than 1.22 (for ρ≥0.6 and b≥2), while variance from non-randomization did not approach that from randomization until ρ≥0.8 for b=1 where the variance ratio was 1.21.

Extension to cluster designs
The cluster-randomized trial, with a randomly chosen subset of communities or other units being longitudinally followed that switched into a new intervention at the same timepoint [23], is similar to the pre-post interventional study we have discussed above. However, it differs in that the outcome of the cluster design is not measured on the entire unit, which instead is taken as an average of the outcomes of m randomly chosen participants at each new timepoint [23]. For example, m=50 new participants at the same community are randomly chosen at each timepoint of a smoking cessation intervention study (where all participants at a community receive the same intervention) and the outcome is the average number of cigarettes smoked among these 50 participants. While typically units are randomized into such pre-post intervention studies where b=0, k=1, repeated-measure designs (b>0, k>0) with non-randomly chosen units are possible, where m different participants being randomly selected within each of all T timepoints. Now the outcome is Y i j an average of m independent participants in i th cluster at time j. We should caution the readers that our notation for the covariance ρ differs from that used in such cluster designs which we denote as ρ as the variance of the Y i j depends on m. To convert the within-unit repeated-measure correlation ρ used in those papers to our ρ defined above, we implement . Therefore, in a cluster design with m=50 participants per cluster where the within-cluster correlation of the outcome is p = 0.02, then the within-unit correlation of the outcome Y i j from different timepoints (j and j') can be derived using the above formula where ρ =

Inclusion of time-invariant covariates
We extend the GLS variance estimate in eqns. (6) and (8) by adding time invariant baseline covariates W 1 ,…,W Q (either policy-level or unit-level covariates) to eqn. (2), which can control for compositional changes and further improve the accuracy by decreasing the unmodeled variance. We again assume endogenicity that the added covariates are not correlated with the error term [13,24]. While this assumption must always be evaluated for any given setting, it is difficult to see how external baseline covariates could be correlated with ε ij except perhaps through interaction with the previously described over selection of poorly (or well) performing units at j−1 to be put into the intervention arm. We noted before, that after collection of the data there are tests for whether endogenicity problems exist [13]. As derived in Appendix 4, if endogneicity problems do not exist, then after inclusion of time invariant covariate into the model.
where R 2 is the multiple correlation coefficient between W 1 ,…,W Q and Y while ρ Q and σ Q 2 are the within-unit correlations and variances of Y after adjusting for baseline covariates W 1 , …, W q and σ Q 2 = 1 − R 2 σ 2 where σ 2 is what the variance would have been without the additional covariates.

Application to representative examples
We selected three representative examples from available data and literature to evaluate potential repeated-measure correlation studies; i) depression measured by the Center for Epidemiologic Studies Score for Depression (CESD) in women seen every 6 months [25] on which we observed ρ=0.55, ii) cardiovascular and other clinical measures of regularly monitored patients for which ~ 0.65 was observed [19], and iii) a longitudinal study of an intervention of community-based sanitation in Indian villages for which ρ=0.75 was noted [26].
Let's assume that for one of these settings, a non-randomized longitudinal study with two intervention arms is planned, for T=4. The goal of interest is to calculate the number of units needed to obtain a specified study size and power. In many settings, the intervention would be implemented after one baseline visit (i.e., b=1, k=3). But from previous discussion of optimal pre-post allocation, the GLS variance of the intervention effect in the nonrandomized design (the main focus of this paper) is minimized with b=k=3. In some settings, b=3, k=1 might be used. Table 1 thus provides the needed sample size (units for both arms) to detect effect size δ=0.25 and δ=0.50 as calculated using eqn. (7) at 80% power for two-sided hypothesis testing at 0.05 significance-level. This is done for (b,k)=(1,3), (2,2) and (3,1) for both non-randomized and randomized setting. For simplicity (and to optimized power), n 0 =n 1 =n is assumed.
By symmetry, the total number of units needed for (b,k)=(1,3) was the same as for (b,k)=(3,1) in the non-randomized design. However, for the randomized design, the total number of units needed was much larger for (b,k)=(3,1) than for (b,k)= (1,3). Thus, the advantage of randomization over non-randomized was small when (b,k)=(3,1), ranging from no reduction in needed units when ρ=0.75 and δ=0.50, to a reduction of 8 units (~5%) from 152 to 144 for both arms when ρ=0.55 and δ=0.25.

Conclusion
The aim of this paper was to develop a power and sample size estimation framework for non-randomized two-arm pre-post interventional studies with repeated continuous longitudinal outcomes using Difference-in-Differences analysis. We presented generalized least squares variance estimates of intervention effect in linear models assuming a jump effect on the outcome immediately after intervention.
An easily implemented formula for variance estimate of intervention effect was derived under the commonly-assumed within-unit compound symmetry correlation among repeated measures. Not surprisingly, the variance decreases as the number of total timepoints (T) increases. However, this must be weighed against the extra cost associated with more follow-up timepoints. For non-randomized DD studies with a constrained T, equal number of pre-and post-intervention timepoints can achieve the greatest power by minimizing the GLS variance of intervention effect.
For power analysis in planning intervention studies with pre-post intervention outcomes and T ranging from 4 to 7, although randomization is always preferred, non-randomization can work nearly as well for high within-unit repeated-measure correlation (ρ≥0.60) in multiple baseline designs where b≥2, while for single baseline designs where b=1, researchers should be more cautious about choosing non-randomization unless a higher correlation (b=1, ρ ≥ 0.80) exists.
To extend GLS variance formulas to studies conducted with a cluster design approach with the outcome Y i j , being an average of m different participants randomly chosen from the cluster we convert the ρ used for the "within-cluster correlation" in those papers to our within-unit repeated-measure correlation ρ using ρ = ρ ρ + 1 − ρ m . We further extended GLS variance formulas by incorporating time-invariant covariates, that can reduce the variance of the intervention effect estimate by (1−R 2 ) where R 2 is the multiple correlation coefficient between those covariates and the outcome.
Several limitations should be mentioned. For simplicity, we focused on designs with no missing data, although such will likely be the case when units are facilities with the outcome data collected by ongoing periodic quality control monitoring even in the absence of a DD study. We assumed an immediate one-time jump effect of the intervention. While the effect may have accruing cumulative or some other patterns in some settings, it may still be very close to an immediate jump. Although compound symmetry correlation is often assumed when planning a study, it may not always hold in practice as covariance could change over time from uncontrollable mechanisms. Relaxation of the above assumptions may likely lead to complicated settings that could be addressed with simulation.
In conclusion, this paper developed a generalized least squares power estimation framework based on compound symmetry correlation that resulted in simple GLS variance formulas of the intervention effect for non-randomized Difference-in-Differences studies which could be implemented with pencil and paper. We investigated the optimal pre-post intervention allocation of timepoints in planning non-randomized longitudinal Difference-in-Differences studies. While randomization is always preferred to reduce the variance estimate of the intervention effect, non-randomization performs relatively well (for T≤7 timepoints) when high within-unit repeated-measure correlation holds particularly if there is a large number of pre-intervention relative to post-intervention timepoints. The formulas easily extend to cluster study designs and adjust for time invariant variables.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.
Hu and Hoover Page 16